• 1 Pre-session reading
  • 2 Today’s practical
    • 2.1 Park 1 - GLM with categorical predictors
      • 2.1.1 Visualise your data
      • 2.1.2 Code a GLM
        • 2.1.2.1 Factorial vs continuous predictor variables
      • 2.1.3 Assesing the fit of our model
      • 2.1.4 Model summaries and outputs
      • 2.1.5 Multiple comparisons test
    • 2.2 Part 2 - a GLM with our own data!
      • 2.2.1 Data generation
        • 2.2.1.1 Data recording
        • 2.2.1.2 Plan for data generation
        • 2.2.1.3 What have we collected?
      • 2.2.2 Analysing our data
        • 2.2.2.1 Our workflow
        • 2.2.2.2 A priori
        • 2.2.2.3 Reading in your data
        • 2.2.2.4 Understand your data
        • 2.2.2.5 Identifying variables/tidying data
        • 2.2.2.6 Identify potential error structures
        • 2.2.2.7 Visualize your data
        • 2.2.2.8 Fit a model to your data
        • 2.2.2.9 Assess the model fit
        • 2.2.2.10 Update the model
        • 2.2.2.11 Interpreting your model
      • 2.2.3 What does this analysis mean?
    • 2.3 Issues with the analysis?
    • 2.4 To summarise
  • 3 Key Concepts
  • 4 Functions covered today
  • 5 Cheat sheets
  • 6 Extra reading
  • 7 Homework
    • 7.1 Homework solution
    • 7.2 This week’s homework

1 Pre-session reading

Time - 60-120 minutes

1.1 Recap

To briefly recap, last week we:

  • Leaned to do basic data visualisations with ggplot() and identify some common plotting issues
  • Learned about geoms and how they can be used to make different sorts of plots
  • Learned how to plot grouped data on the same plot (using different colours, shapes, and line types) and on different plots (facets)
  • Learned how to save your plot
  • Refresh basic statistics (normal distribution, comparing means, etc)
  • Learned how to check whether data is normally distributed
  • Thought about how we can analyse normally distributed data
  • Carried out a t-test and interpreted the results
  • Carried our your own statistical analysis!

1.2 ILOs

Today we are going to:

  • Learn what a GLM is, and how it relates to simple statistics like linear regressions
  • Learn what error distributions are
  • Learn what link functions are
  • Learn what to do when we don’t have normally distributed data
  • Learn what error distributions should be considered for count data
  • Learn how to fit a GLM
  • Learn how to visualise residuals (difference between predicted and observed)
  • Learn how to tell if a model is fitting our data well
  • Learn how to explore alternative models
  • Learn how to compare models
  • Learn how to interpret model summaries

1.2.1 GLMs

1.2.1.1 What is a GLM?

Generalized Linear Models (GLMs) are a critical and hugely useful technique for data analysis. They are Generalised Linear Models because they allow you to specify the error distribution of your model (e.g. gaussian, poisson, etc) and link function (e.g. identity, log, etc). This means that they are hugly flexible you can fit them to count data, binary data, or data where the predictor variables are categorical or continuous. Critically, GLMs can be fit to data where the errors are normal or non-normally distributed. Linear regressions - which I am sure you are all familiar with, or at least aware of - are a special case of the GLM, where data has a gaussian (normal) distribution with an identity link (see below).

1.2.1.2 Why are we studying GLMs?

As we said above, GLMs are fabulously flexible - they can do t-test style analyses with two or more categorical predictors, linear regressions with continuous predictor variables, and mix both categorical and continuous predictors in multiple regression analyses. We will also consider the extension to GLMs where we can account for non-randomness between samples (so called mixed models). GLMs are complex beasts, but are widely used to analyse biological data because of its often nested nature, and lack of normality, and so we are going to cover this topic in-depth here rather than spread thinly across many different statistical techniques. We will touch on some other statistics in upcoming lectures.


The following two sections are a quick refresher on error distributions and link functions. Feel free to skip these if you are comfortable with the topics.

1.2.1.3 Error distributions

Last week we talked about normally distributed data, and we learned what that means in terms of visualizing data in a histogram (classic bell-curve). In reality, that was a slight simplification of what we mean when we talk about normally distributed data. What we are really interested in is normally distributed error.

1.2.1.3.1 What do we mean by error?

Well below we have a histogram of data where the mean of the data is shown with a dashed line. When we are running statistical test we are essentially comparing the mean values of different groups (e.g. height as we did last week), and looking at how varied the data are around those means to estimate how likely they are to be statistically different between groups. This variance around the mean is our “error” - effectively, how wrong the mean value of the group estimated by the model is when looking at all of the observed data. It is this error around the mean that we want to be normally distributed (if we are fitting a model which assumes a normal distribution, like the t-test we did last week).

So, the error distribution specifies the structure of the error in the model - i.e. the variance in the data not explained by the predictor variables (e.g. sex explaining height).

To extend beyond the t-test we looked at last week, lets take the example of a Gaussian (normal) distribution where our predictor variable isn’t categorical (sex - male or female) but rather continuous. The expectation of fitting a model with a Gaussian (normal) is that the residual error (error not explained by the model) is normally distributed around the fitted model:

Here we have a linear model (blue horizontal line in the Fitted model pannel) where there is no effect of a change in the x value on the y values (i.e. the slope of the blue line is ~0). The data are plotted in light blue, and the residual error (difference between the predicted values [i.e. the blue horizontal line] - and the observed values [i.e. the light blue dots] is shown by the grey lines). You can see that the residuals are normally distributed around that fitted blue line (see histogram of residuals).

In effect, we are trying to minimise the residual error as less error = a model which fits the data better. So we are looking for a roughly normal distribution of errors in our histogram of the residuals (top right panel above, ☑), with a roughly linear patter in our fitted vs residual plot (bottom right panel above, ☑).

If we now fit a Gaussian model to count data (e.g. data on the number of a species at a sample site):

You will see that the residuals are very not normally distributed (because we are working with counts which are

  1. integer values (whilst the normal distribution produce any real number value between negative infinitly and positive infinity), and
  2. count data are bounded at 0 (you can’t observe <0 birds)

Consequently, fitting a Gaussian model to count data like that in the plots above isn’t appropriate.

Using an error distribution other than the Gaussian allows us to deal with data which are non-normally distributed, or which don’t conform to the assumptions of the normal distribution (e.g. are integers). In the above example we would probably use a poisson distribution (specifically designed to cope with positive integer values).

For more information on the error families for glm’s have a look at the help - ?family.

2 Today’s practical

2.1 Park 1 - GLM with categorical predictors

Time - 60 minutes

So, let’s think about fitting a GLM to data. We are going to work on a few different data sets to show how GLMs can be used in different scenarios (and how flexible they are). First off we are going to briefly go through an example using a GLM to carry out an ANOVA style analysis (looking to see whether there are differences between the mean values of different groups in data). This is conceptually very similar to the t-test you did last week, but with 3 groups rather than comparing a pair of groups. We will then deep-dive into a more complex example later on.

For our simple first example we will analyse data which we worked with last week, the Palmer Penguins flipper lengths:

##load the package
library(palmerpenguins)
##load the iris data set
data(penguins)

Take a look at the structure of the data. Try and understand what might be response variables and what might be predictor variables. Discuss this in your groups.

2.1.1 Visualise your data

Right, importantly we need to visualize our data. This is critical to understanding the structure of our data, and thus how the data should be analysed.


Task

Plot the flipper width data grouped (split) by species so we can eyeball whether there might be differences between the three species in the data. In the below plot I have “jittered” the data, which means we have artificially added variance to the data on the x-axis so we can see the points more easily. See if you can do this in your plot (hint - geom_ is your friend!)


## Plot the flipper length between the different species
ggplot(penguins, aes(x=species, y=flipper_length_mm)) + 
  geom_jitter(aes(col=species)) +
  theme_bw()

Great, so it looks like there might be some differences between the species (particularly between Gentoo and the other two species). We want to test if these differences are down to chance (e.g. an artifact of sampling), or whether probability suggests that these differences are really there between the species. To do this we want to compare the variance and means of the flipper_length_mm between the three species.

However, as we have plotted it above it is quite hard to discern what the distributions of the data are (e.g. are they normally distributed?), so we can do a better job at visualising the data.

Histograms are particularly useful for seeing whether data are normally distributed, and are (luckily) easy to do with ggplot.


Task

Tweak the following code to make a plot like the one below:
ggplot(..., aes(x=..., fill = species)) +
  ## bin width determines how course the histogram is
  ## the alpha determines the transparency of the bars
  ## position allows you to determine what kind of histogram you plot (e.g. stacked vs overlapping). try changing to position="stack"
  geom_histogram(binwidth = 1, alpha = .5, position="identity")

ggplot(penguins, aes(x=flipper_length_mm, fill=species)) +
  geom_histogram(binwidth=1, alpha=.5, position="identity")

So from the above we can visualise:

  1. how normally distributed the three sets of data are (the good news is they look fairly normal)
  2. approximately how overlapping they are, and thus how likely they are to have statistically different means. This gives us a good mental check in case we get something unexpected in our statistics later.

We can also try and make the qq plots we did last week, to check whether the data grouped by species appear normally distributed.


Task

Produce the below plots, adapting the code we used to do this last week.


2.1.2 Code a GLM

So we want to test whether there is any significant difference between the mean sepal widths of the three species. This means we have a continuous response (y) variable, and a categorical predictor (x) variable. This makes the coding for our GLM nice and simple. We have visually checked the data and there don’t appear to be any obvious issues with it, so we can forge ahead.

GLM is a base function in R, so we don’t need to load any packages. To run a GLM:

##fit a glm()
mod_flipper <- glm(flipper_length_mm ~ species,
            ##specify the data
            data = penguins,
            ##specify the error structure
            family = "gaussian")

Here we specify the formula of the model using ‘~’, where the first term is the dependant (y/response) variable, and the following term/terms is/are the independant (x/predictor) variables - i.e. the ones we believe shoud influence the value of y, but which arent influenced by the value of y. So flipper_length_mm is our dependant vairaible, and species is our independant variable.

Hint - you can read ~ in R as “as a function of”. So in this case, we are testing whether the Sepal.Width changes as a function of species

We have also specified the “family” to be “gaussian” (actually we didnt need to do this as this is the default setting, but its good practice to make sure we understand what exactly we are doing). Family here means the error structure, so we are assuming (initially) that our residuals are normally distributed, this means that - given our current settings - we are actually just fitting a *ANOVA to the data.

*an ANOVA is like a t-test but with more than 2 groups.

2.1.2.1 Factorial vs continuous predictor variables

R will make some assumptions about what kind of predictor variables you want to use based on the type of data the predictor variables are stored as in the data frame. For example, if the predictor variable is numeric then R will assume it is a continuous predictor variable, i.e. one which could take any value (whole or not). If the data is a character or logical vector, then R will assume it is a factorial predictor variable, and that it can only take the values observed in the vector.

Why does this matter? Well, we can use a model fit in R that has a continuous variable to predict what the response variable should be at values of the continuous variable we haven’t yet seen. For example, in the penguin data there are data on the bill length and bill depth. The longest measured bill length is 59.6cm, and this corresponds to a bill depth of 17mm. We might want to know what we would expect the bill depth to be when the length is 65cm. We haven’t observed this, but we could predict this from a model we have fit to these two variables (assuming there is a correlation between the two!).

We can’t do this for factorial predictors, as we have no expectation what other a new factorial value would do to the relationship (e.g. we can’t predict mean flipper length for a new species as we have no idea what that looks like).

We can force R to use numeric predictors as a factorial predictor in our model (which we might want to do if something has been coded numerically but isn’t really continuous) by:

x ~ as.factor(y)

2.1.3 Assesing the fit of our model

Statistics is as much art as science, and assessing how well a model fits data can be complex and time consuming. There isn’t an automated or best approach for doing it.

Remember:


Luckily R has some nifty in built functions for helping us. We will start of using some of the base methods to do this, and then move onto some more complex ways later today and next week.

Once we have coded and run the model we have an object (above called mod_flipper):

##display the class of the model object
class(mod_flipper)
## [1] "glm" "lm"

You can see this is a special class, a GLM and LM (LM just means linear regression). There are some basic functions for assessing the fits of GLMs in R (try running plot(mod_flipper)), but - once again - packages can come to our rescue here, specifically the excellent performace package.


Task

Install the package and run the check_model function. See here for vignette.

check_model() helpfull tells us how we should interpret the output above - and to me (at least) this all looks fine.


Note - For a more in depth discussion of how to interpret the above diagnostic plots see the excellent blog post here.


We will go more into assessing the fits of a model later on in this session and next week, but for now it looks like this simple modelfits the data well and which we can be pretty confident in. In fact, this is about as good a fit as you are going to get when fitting a model to real data!

2.1.4 Model summaries and outputs

So we have fit our model, and ensured that our model fits the data well. Now we want to look at what our model says about the data.

First off let’s look at the model output, using the summary() function:

##summarise the model outputs
summary(mod_flipper)
## 
## Call:
## glm(formula = flipper_length_mm ~ species, family = "gaussian", 
##     data = penguins)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -17.9536   -4.8235    0.0464    4.8130   20.0464  
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      189.9536     0.5405 351.454  < 2e-16 ***
## speciesChinstrap   5.8699     0.9699   6.052 3.79e-09 ***
## speciesGentoo     27.2333     0.8067  33.760  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 44.1099)
## 
##     Null deviance: 67427  on 341  degrees of freedom
## Residual deviance: 14953  on 339  degrees of freedom
##   (2 observations deleted due to missingness)
## AIC: 2270.6
## 
## Number of Fisher Scoring iterations: 2

You can see this gives us further details on the model. The important bits for us at this stage are the coefficients.

Note - for a full discussion of the output see the excellent posts here and here.

We can see there is a significant positive effect of both speciesChinstrap and speciesGentoo. But what is this measured against? And where is the third species, Adelie?! Well, R takes the first (alphabetically in our case as these are categorical predictors) value of our predictor variable and uses this as our comparison point. This is called the intercept in the model output.

So, to read our output table:

The estimate value for the intercept is the estimate of the mean value of the flipper_length_mm for our interecept species (Adelie) - 189.9536. If we look at our first plot (and here I’ve added a crossed circle at 3.428):

We can see that that looks about right.

Our values for our other two species are relative to this mean. So to calculate the mean value for Chinstrap we need to add its Estimate (from the above table) to the Intercept:

189.9536 + 5.8699
## [1] 195.8235

And the same for Gentoo:

189.9536 + 27.2333
## [1] 217.1869

Again, eyeballing the graph above corroborates these estimated mean values for each species. Great!

We can also see there is a significant difference in all three values in our table (all values in the Pr(>|t|) column are <0.05 - i.e. there is a less than 5% probability this pattern occurred by chance).

For the Intercept value, we interpret this as the intercept Estimate being significantly different from 0 (not very important in our case).

For the other two values, our interpretation is that they are significantly different from our intercept (the mean value for Adelie).


Question

Is the mean value of Chinstrap significantly different from the mean value for Gentoo?






The simple answer to the above question is…we don’t know. Or at least, we can’t tell from this analysis. All we are doing here is testing for differences from the intercept (Adelie). To discern whether there are differences between the other species we need to do some further analysis.

2.1.5 Multiple comparisons test

To find out whether there are significant differences between our groups we need to do a multiple comparisons test (i.e. compare the differences between all of the categories in our predictor variable). Luckily, this is easy to do in R:

## load the multcomp pack
library(multcomp)

## run the multiple comparisons, and look at the summary output:
summary(glht(mod_flipper, mcp(species="Tukey")))
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: glm(formula = flipper_length_mm ~ species, family = "gaussian", 
##     data = penguins)
## 
## Linear Hypotheses:
##                         Estimate Std. Error z value Pr(>|z|)    
## Chinstrap - Adelie == 0   5.8699     0.9699   6.052 2.75e-09 ***
## Gentoo - Adelie == 0     27.2333     0.8067  33.760  < 1e-09 ***
## Gentoo - Chinstrap == 0  21.3635     1.0036  21.286  < 1e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

Question

Using your newly acquired knowledge of how to read model summaries in R: which of the species are significantly different from one another?






2.2 Part 2 - a GLM with our own data!

Ok, so you’ve done a simple GLM. Now its time to get a bit more complicated, and to do this we are going to generate a data set (as a class) and then analyse it together.

2.2.1 Data generation

Time - 20 minutes

I will give a introductory on this section so we are all up to speed.

Once you have had this talk you have 20 minutes to go away and create your data, and add it to the spreadsheet.

2.2.1.1 Data recording

Each group needs to make a spreadsheet with the following columns (and in this order! so we can combine them all together).

Group name Thrower Attempt Species Bucket Distance Success
Something funny Chris Clements 1 Blue 1 1.5 0
Something funny Chris Clements 2 Red 1 5 1
Something funny Chris Clements 3 Green 5 0.5 1

Things to consider:

  • To make things as simple as possible please ensure all the ball colours are spelled correctly, and have no capitals or spaces! i.e.: “lightpink” not “Light Pink” etc.

    • yellow
    • red
    • darkpink
    • lightpink
    • purple
    • lightblue
    • darkblue
    • teal
    • green
    • orange
  • Keep your group name and your name consistent (copy and paste it into new cells). Put your first AND last name into the thrower column.

  • For bucket volumes enter either 1 (for the small) or 5 (for the large). Nothing else!

  • For distances enter the distance in meters (so 50cm is entered as 0.5). No need to specify units (i.e. don’t put “.50m”)

  • Success can be 0 or 1 (it either stayed in the bucket or did not), nothing else.

Use this template to save your data:

Download Blank data table.xlsx

2.2.1.2 Plan for data generation

I’ve talked over the plan, but in case you need to check anything:

  1. Split into your groups
  2. Each group has 2 buckets of 2 different sizes
    • 1L (entered as “1” in Bucket column)
    • 5L (entered as “5” in Bucket column)
  3. Marking a throwing line
  4. Decide on 4 distances you want to place the buckets at (e.g. 75cm, 130cm, 220cm, 300cm - you can choose), between a minimum of 50cm and a maximum of 3 meters. Measure these from your throwing line and place one bucket at each of these distances. You can decide which buckets you want to put at each distance.
  5. Take it in turns to throw 10 balls and try and get them into the buckets. Randomise which color ball you are throwing by blindly picking them out of the bag/tub, and randomize the bucket you are aiming for with each throw
    • try the sample() function in R
  6. Each person will have 10 throws
    • Easiest to do these 10 throws in a row. It will make data recording easier and counting which attempt you are on easier too
  7. For each attempt by each person recorded:
    • The thrower
    • Which attempt this is (count up from 1 to 10)
    • The color of the ball (the “Species”)
    • The bucket size
    • The distance from the throwing line
    • Whether the throw was successful or not
  8. Once you have all been and you have saved the data in your data file email them to me c.clements@bristol.ac.uk

2.2.1.3 What have we collected?

Once you finish the data collection spend some time thinking about the following questions:

  1. What are our predictor variables?
  2. What are our response variables?
  3. What response variables could we calculate from the response variables we have?
    • Think specifically about IBT
  4. What distribution are our response variables likely to be?
    • Gaussian? Poisson?
  5. What are our a priori predictions?
    • What is our null hypothesis?

We will discuss these in class.

2.2.2 Analysing our data

Time - 85 minutes

2.2.2.1 Our workflow

In your pre-session reading we covered a workflow for data analysis:

  1. Read in your data
  2. Understand the structure of your data
  3. Consider the variables in your data (predictor vs. response), and identify potential error structures.
  4. Sort your data and apply any corrections/calculations
  5. Visualize your data
  6. Fit a model to your data
  7. Assess the fit of this model to your data
  8. Update the model (if necessary)
  9. Assess the fit of this model to your data (if necessary)

etc.

  1. Interpret the model results

We are now going to do this with the data we have just generated. Below under each of the headings are some tips/hints on functions/code which might be useful for each of the above stages.


Task

As a group work through each of the sections below using all the data we have just collected across the groups. Write you code in an accessible way so that you all understand what you have done and why.

2.2.2.2 A priori

  • What are our a priori expectations? What are our predictions?
    • As a group, draw your expectations - this will help you formalize what patterns you are expecting to see, and will also help you decide what you might need to put into your model as predictor variables.
  • What is the null hypothesis we are testing?

2.2.2.3 Reading in your data

  • Useful functions: vroom()
  • Think about how we read in online data sets in practical 3 (go back and have a look if you need to)

2.2.2.4 Understand your data

  • str(), head(), tail()
  • get a sense for what your data look like
    • hopefully you will have a good idea of this given you generated the data! But take a look at other group’s data too

2.2.2.5 Identifying variables/tidying data

  • What variables are important here?
  • What are predictor variables and what are response variables?
  • Do we need to transform the data to calculate our response variable?
    • We are going to use number of species as a response variable, so how do we calculate this from our data set?
  • summarise(), unique(), length(), group_by() might all be useful

2.2.2.6 Identify potential error structures

  • What sort of data do we have? Number of species.
  • Is this continuous?
  • Is a Gaussian distribution appropriate (always start here as a gaussian is the easiest to fit!)
  • If Gaussian isn’t appropriate what might be?

The fitdistrplus package can help you here. It allows you to plot your data against various distributions (you can specify) to see whether your data approximately fit the expected distribution:

##load the fitdistrplus library
library('fitdistrplus')

##plot the fit of the observed values against a specified distribution
##what might the distribution be for these data?
plot(fitdist(observed_values,"?"))

2.2.2.7 Visualize your data

  • Critical step! What to do the data look like?
  • What plots should you make?
    • Think about trying to plot your hypothesis
    • i.e. if we expected temperature to influence metabolic rate, we would plot temperature on the x-axis and metabolic rate on the y-axis
  • ggplot(), geom_point(), stat_smooth(), facet_wrap()

2.2.2.8 Fit a model to your data

  • We have a hypothesis, now code that as a model
    • for the above example we would have metabolic_rate ~ temperature
    • try writing out the formula based on your expectations
    • Do we want to consider more than one predictor variable?
    • Do we want to consider interactions between variables?
      • remember - If we want to include an interaction then we need to include all of the individual effects too (see example below)
  • We are going to use a GLM to analyse this data, so glm()
## an example glm with an interaction and two predictor variables
## note - both of the terms in the interaction are included separately too
mod1 <- glm(metabolic_rate ~ temperature + treatment + temperature:treatment,
            data = example_data,
            family = "gaussian")

2.2.2.9 Assess the model fit

We did a brief assessment of the model fit in the first example, but in reality we are going to want to spend more time and energy digging to find out how well out model really does fit the data. This is especially important as the models become more complex.

Luckily, the performance package massively helps with this, as - depending on what model you have fit (e.g. Gaussian, etc) - it will update the diagnostic tools to help you understand how well the model is doing. What do these tests show? Can you come up with solutions to these issues?


ONE THING TO NOTE - you are never going to get a model which perfectly fits your data! (hence the “all models are wrong…”). We need to get a model which fits as best we can, so that any interpretations of the data are (with the highest likelihood) correct.


Below are some sample code chunks to help us assess our model fits

2.2.2.9.1 Observed vs predicted

A really simple first-pass way of assessing how well your model fits your data is simply to plot the predicted values against the observed values. If your model fits well then The relationship between the observed and expected values should be approximately one to one.

##visualize the fitted vs observed values
fit_data <- data.frame("predicted"= your_model$fitted.values,
                       "observed" = your_data$response_variable)
##plot them
ggplot(fit_data, aes(x=observed, y=predicted)) + 
  geom_point() + 
  ##add a 1:1 line
  geom_abline(intercept = 0) +
  ##add a linear regression to your data
  geom_smooth(method="lm",
              col="lightblue",
              se = F) +
  ##set the theme
  theme_classic()

Question

So, how well does your model fit your data?


2.2.2.9.2 Look at the model summary

The summary of your model can tell us something about how our model fits.


Task

Use Google to try and understand the difference between Null Deviance and Residual Deviance, and then look at the summary of your model and apply what you have just learned.


summary(your_model)
2.2.2.9.3 More advanced techniques

There are also a suite of more advanced techniques which include simulating data based on your model fits - we will cover these next week.

2.2.2.10 Update the model

So, the next thing to do is fit some alternative models to the data to see if they do a better job. We might explore this even if the model looks like a pretty good fit (remember we are trying to minimise the residual error in the above plots - a model with less residual error fits our data better).

There might be two reasons for doing this:

  1. you might want to assess different error structures
  2. you might want to check whether the model fits better with a different structure (i.e. by including or dropping predictor variables)

Question

Should we consider reason (2) for our current analysis?


In the case of reason (1) the data we are working with are counts (number of species at each distance and bucket size), so what options do we have other than the poisson distribution?


Task

Google other options for analysing count data with a glm() in R.


We aren’t going to fit more models to this data today, as we don’t have time, but we will come back to this next week and think about alternative models, how to fit them, and how to select between them.

Assess the model fit

etc.

Update the model

etc.

2.2.2.11 Interpreting your model

So, we have a final model. Now we need to interpret it! To do this we need to look at what the model output is - using the summary() function:

##summarise the model outputs
summary(best_mod)

Question

Discuss among your group - can you work out what the summary table is telling you? How does this relate to your null hypotheses? Do your predictor variables have a significant effect?


2.2.2.11.1 Visualising your model

We may want to simply report the summary statistics from the model (see Interpreting your model), however it is often more powerful (remember - we are visual creatures) to plot the output of your model to visualise what the statistics are telling us.

Luckily (as with most things in R) someone has developed a couple of excellent packages to help us with this. Install the following packages (you will likely need to also install some additional packages, read the error messages!):

library(jtools)
library(interactions)

Then try and get to grips with the following functions to visualise your models:

plot_summs()

interact_plot()

Question

What do each of these functions allow you to do? Are both useful? Which output is easier to interpret?


Hint - try these excellent help pages on the packages here and here.

2.2.3 What does this analysis mean?

So, we have some results, but what do they mean in the context of our hypotheses and the data we collected. Remember why we do analyses - to hypothesis test, or to look for patterns in data, but most of all to answer a question (be that is there higher mortality in people who go sky diving than those who go scuba diving, or if there are more speices on closer, larger islands). Consequently without relating our results and conclusions back to the questions we are asking and putting them in the context of current knowledge in the field our findings become meaningless.

2.3 Issues with the analysis?

Finally, lets spend a bit of time thinking about the following statement, and how it might relate to our model:




Question

What have we missed?


Consider things we might have failed to account for in our model (e.g. what processess driving the structure of the data have we not explicitly accounted for?). Is it feasible to incorporate these things into the analysis? If yes, should we do this? If not, does this mean we don’t trust our model output?

2.4 To summarise

So, now you know how to fit a simple GLM, and a more complex GLM with multiple predictors. You know how to assess the fit of this model (in a basic way) and visualise the output. Next, we will work on a single population time series from the species data you plotted for homework last week, and see whether there has been a significant increase, decrease, or no change in that population over time.

We will go through the process of fitting a GLM to the data, checking the fit and other potential models, and then think about plotting the outputs of the model.

3 Key Concepts

Let’s quickly summarise what we have learned today:

  • What is a GLM, and how does it related to simple statistics like linear regressions
  • What are error distributions
  • What are link functions
  • What do we do when we dont have normally distributed data
  • What error distributions should we consider for count data
  • How to fit a GLM
  • How to visualise residuals (difference between predicted and observed)
  • How to tell if a model is fitting our data well
  • How to explore alternative models
  • How to compare models
  • How to interpret model summaries
  • How to plot your model output (briefly, more on this next week)

That’s a lot to cover in one week. Don’t worry if you didn’t manage to work through everything or some parts you find difficult - you can always come back to this and refer to it at any time.

4 Functions covered today

Function/operator Call
Run a GLM model glm()
Return predicted values from a model predict()
Return the residuals of a glm resid()
Compare models using AICc AICc()
Summarise a model summary()
Order the items in a column order()

5 Cheat sheets

ggplot

7 Homework

7.1 Homework solution

There isn’t a right or wrong way to do last week’s homework, but you will have got feedback on some of the good and less good points from your Rmarkdown.

##Load in the penguin data
library(palmerpenguins)

##load the tidyverse
library(tidyverse)

##load rstatix
library(rstatix)

##load the penguin data
data(penguins)

##look at the data
head(penguins)
str(penguins)

##filter for gentoo and adelie
G_and_A <- penguins %>%
  filter(species==c("Gentoo", "Adelie"))

##plot the data
ggplot(G_and_A, aes(x=species, y=flipper_length_mm, col=species)) + 
  geom_boxplot() +
  geom_jitter(width=0.15) +
  theme_classic2()

##check for normality
ggqqplot(G_and_A, x="flipper_length_mm", color="species")

##and histogram
ggplot(G_and_A, aes(x=flipper_length_mm, fill=species)) +
  geom_histogram(alpha=0.6, position = 'identity') +
  theme_classic2()

##carry out t-test
pengs_T <- G_and_A %>%
##reset levels of the factor
mutate(species=as.character(species)) %>%
##drop any NA values
drop_na() %>%
##also tell it to return a more detailed output 
t_test(flipper_length_mm ~ species, detailed=T) %>%
              ##return the significances too
              add_significance()

##look at the results
pengs_T

7.2 This week’s homework

So we have learned how to fit a GLM today, and plot the output. Let’s apply this knowledge and extend our penguin analysis from earlier. So, for homework write a short Rmarkdown document (submit as a html) to:

  • Load in the Penguin data
  • Fit an appropriate GLM (evidence why your model/distrubion is appropriate) which tests to see whether there is a significant difference between flipper length and species (as we did earlier today), but also include sex as a predictor variable
  • Plot the coefficients
  • Plot the interactions
  • Display the summary table and write 2 sentences to interpret the results